Exploiting Coreference Annotations for Text-to-Hypertext Conversion

نویسندگان

  • Anke Holler
  • Jan Frederik Maas
  • Angelika Storrer
چکیده

The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking it is necessary to separate the annotation of coreference relations from the annotation of anaphoric relations. To account for this requirement, we developed a knowledge-based annotation scheme that relates referential expressions in the text to entities in a knowledge representation, which is modeled using XML Topic Maps. 1. Project Framework Converting linear text documents into documents that can be published in a hypertext environment is a complex task requiring conversion software on the technical side as well as conversion strategies and methods on the conceptual side. In the project HyTex1, which is the framework of the approach discussed in this paper, we concentrate on principles and strategies for handling conceptual problems of text-to-hypertext conversion such as: • S e g m e n t a t i o n : What are the criteria for segmenting documents into text segments to be used as hypertext nodes? • Reorganization: What are the guidelines for generating “cohesive closedness” in hypertext nodes, i.e. what kinds of transformations are necessary to unchain text segments from their linkage to the reading path of the sequential document, so that they may be integrated into different user-selected pathways? • Linking: What are the guidelines and principles for reconnecting the nodes via hyperlinks? Using XML as the technical basis for hypertext modeling and viewing, the project develops strategies and methods which (semi)-automatically create hypertext layers and views based on text-grammatical annotations. By storing 1 The acronym „HyTex“ is spelt out as Hypertextualisierung auf textgrammatischer Grundlage (‘Hypertext conversion on a textgrammatical basis’). The project was launched in November 2001 as part of the research group Text technologische Informationsmodellierung (‘Text-technological information modelling’), cf. http://www.text-technology.de. For more information on the HyTex project see http://www.hytex.info. the hypertext as additional document layers, our approach preserves structure and content of the original text documents, and thus provides the reader with the choice between sequential and selective reading modes. The general aim of the project is to support selective hypertext readers in finding coherent pathways through the document network and thus make selective reading and browsing more efficient and more convenient than it would be possible with printmedia. Feasibility and performance of the methodology is tested and evaluated using a German text corpus, which comprises documents that deal with two subject domains, namely "text technology" and “hypertext research” (Lenz & Storrer, 2002). The central idea of the conversion approach in HyTex is to base strategies for segmentation, reorganization and linking on information coming from two levels: • On the document level, we explicitly markup the text-grammatical structures and relations between text segments, e.g. coreference relations, semantics of connectives, text-deictic expressions, and expressions indicating topic handling. • On the domain knowledge level, we represent the main concepts of this subject domain and their interrelations, using the WordNet model (Fellbaum, 1998) as the conceptual and XML Topic Maps (XTM, 2001) as the technical basis (Beißwenger & Storrer & Runte, 2004; Lenz & Birkenhage & Maas, 2004). A dynamic-adaptive component that processes logs of usage has been considered but not been put into practice during the current phase of the project. In a later stage, this document usage level would supply information about

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge-lean projection of coreference chains across languages

Common technologies for automatic coreference resolution require either a language-specific rule set or large collections of manually annotated data, which is typically limited to newswire texts in major languages. This makes it difficult to develop coreference resolvers for a large number of the so-called low-resourced languages. We apply a direct projection algorithm on a multi-genre and mult...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Converting a Corpus into a Hypertext: An Approach Using XML Topic Maps and XSLT

In the context of the HyTex project, our goal is to convert a corpus into a hypertext, basing conversion strategies on annotations which explicitly mark up the text-grammatical structures and relations between text segments. Domain-specific knowledge is represented in the form of a knowledge net, using topic maps. We use XML as an interchange format. In this paper, we focus on a declarative rul...

متن کامل

Coreference resolution with syntactico-semantic rules and corpus statistics

A new hybrid approach to the coreference resolution problem is presented. The CORUDIS system (COreference RUles with DIsambiguation Statistics) combines syntactico-semantic rules with statistics derived from an annotated corpus. First, the rules and corpus annotations are described and exemplified. Then, the coreference resolution algorithm and the involved statistics are explained. Finally, th...

متن کامل

What Is Coreference, And What Should Coreference Annotation Be?

In this paper, it is argued that 'coreference an-notation', as currently performed in the MUC community, goes well beyond annotation of the relation of coreference as it is commonly understood. As a result, it is not always clear what semantic relation these annotations are actually encoding. The paper discusses a number of interrelated problems with coreference annotation and concludes that re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004